
Welcome to the Decision Science exercise of your data certification exam!
Here are a few words to describe how the decision_science directory is organised:
# Start by running this cell to see the architecture of the directory
!tree
. ├── README.md ├── boston_crimes.ipynb ├── data ├── data.py ├── db └── tests 3 directories, 3 files
boston_crimes.ipynb notebook that you currently have under your eyes is the main document. You will find all the instructions here and except when it is explicitly specified, you should provide all your answers in this notebook;data and db folders will be filled-in throughout the exercise respectively with .csv datasets and a .sqlite file, for you to run all your analyses; assets folder for this exercise;tests folder will contain all the .pickle files that will be saved throughout the exercise with your key findings. Please run all the "Save your results" cells when completing the exercise!⚠️ Important remark before you dive into the exercise. This notebook is quite long and it is easy to get lost in it: take full advantage of the collapsible headers and of the table of content. If you have not yet activated these Jupyter Notebook extensions, you may consider doing so now!
You can use this section to run your imports in a centralised manner throughout the exercise.
# Load the nbresult package to be able to save your results
from nbresult import ChallengeResult
# Useful import for data collection
import sqlite3
# Useful imports for data manipulation and analysis
import numpy as np
import pandas as pd
# Useful imports for data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
# Useful imports to estimate regression models
import statsmodels.formula.api as smf
During the last municipal campaign in Boston, criminality has been a major topic of debates. As citizens have expressed strong expectations from her on that front, the newly-elected mayor of Boston is looking for data-based insights on criminality in the Massachussetts capital. She has mandated your economics and urbanism consulting firm, The Locomotive, for this study.
Download the boston_crimes.sqlite database from this URL and store it inside the db folder.
# You may directly run this cell to do so
!curl https://wagon-public-datasets.s3.amazonaws.com/certification_france_2021_q2/boston_crimes.sqlite > db/boston_crimes.sqlite
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 19.1M 100 19.1M 0 0 488k 0 0:00:40 0:00:40 --:--:-- 44505
This section aims at giving you a first overview of the database.
As you will see, it consists in three tables:
incident_reports table has been provided by the Police Department of Boston. Each observation corresponds to a criminal incident that has required an intervention by the police in the municipality of Boston; districts table has been provided by the Urbanism Department of Boston. It gathers geographical information about the various police districts of Boston; indicators table has been shared by the Economics Department of Boston, which keeps track of various indicators of the social and economic activity of Boston neighborhoods. Each observation corresponds to a police district.More information about the different fields of these three tables can be found in the dedicated README.md file
🔍 Using your preferred SQL client, connect to the database and browse through it as you wish to get acquainted with the data.
📝 Draw the database schema thanks to the schema editor on Kitt.
📝 Download the schema and save it as boston_crimes.xml in the db folder.
Now that you have a good overview of the database, you can kick off the work! You will start with an SQL query to gather the relevant information.
📝 Write the SQL query you need to fetch the data. Save it as a str in the query variable.
import sqlite3
conn = sqlite3.connect("/Users/jaouadahraoui/code/Jaouad-AHRAOUI/data-certification-exam/decision_science/db/boston_crimes.sqlite")
c = conn.cursor()
# TODO: write the query
query = """
SELECT indicators.CODE, NAME, COUNT(INCIDENT_NUMBER) AS NB_INCIDENTS,
MEDIAN_AGE, TOTAL_POP, PERC_OF_30_34, PERC_MARRIED_COUPLE_FAMILY, PER_CAPITA_INCOME, PERC_OTHER_STATE_OR_ABROAD,
PERC_LESS_THAN_HIGH_SCHOOL, PERC_COLLEGE_GRADUATES
FROM indicators JOIN districts ON indicators.CODE = districts.CODE JOIN incident_reports ON incident_reports.DISTRICT = districts.CODE
"""
# TODO: Execute the query
c.execute(query)
# TODO: Fetch and print the results
c.fetchall()
[('D14', 'Brighton', 237221, 30.8, 55297, 52.8, 26.4, 41261, 8.6, 6.7, 10.5)]
📝 Store the output of the query in a DataFrame named crimes_df. Display the 5 first rows, as well as the shape of the dataset.
import pandas as pd
crimes_df = pd.read_sql_query(query, conn)
You can run the following cell to save your results:
ChallengeResult('sql', query=query, data=crimes_df).write()
As mentioned above, we want to investigate the impact of the socio-economic characteristics of the different Boston police districts on the number of incidents that are reported in these areas.
import pandas as pd
data = pd.read_csv ('/Users/jaouadahraoui/Downloads/boston_crimes_regression.csv')
data.head()
| MEDIAN_AGE | TOTAL_POP | PERC_OF_30_34 | PERC_MARRIED_COUPLE_FAMILY | PER_CAPITA_INCOME | PERC_OTHER_STATE_OR_ABROAD | PERC_LESS_THAN_HIGH_SCHOOL | PERC_COLLEGE_GRADUATES | CODE | NB_INCIDENTS | NAME | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 30.8 | 55297 | 52.8 | 26.4 | 41261 | 8.6 | 6.7 | 10.5 | D14 | 13788 | Brighton |
| 1 | 35.7 | 19890 | 28.2 | 36.4 | 75339 | 3.4 | 7.9 | 8.2 | A15 | 4765 | Charlestown |
| 2 | 33.4 | 126909 | 28.2 | 26.6 | 29767 | 2.4 | 18.0 | 17.1 | C11 | 32875 | Dorchester |
| 3 | 33.5 | 18306 | 32.5 | 35.8 | 80057 | 14.8 | 15.4 | 6.9 | A1 | 26260 | Downtown |
| 4 | 30.6 | 47263 | 31.1 | 30.4 | 31473 | 3.5 | 27.2 | 11.5 | A7 | 9691 | East Boston |
Thanks to the Statsmodels Formula API, we will run the regression described below.
The dependent variable (or target variable) should be the total number of incidents reported in each police district.
We will focus on the following regressors:
📝 Based on these indications, estimate the linear regression model and output its summary in this section of the notebook. Store the estimated model inside a model variable.
import statsmodels.formula.api as smf
import statsmodels.api as sm
# Create target variable y (Series) and feature X (DataFrame) without data leaks!
y = data["NB_INCIDENTS"].copy()
X = data[[
"MEDIAN_AGE",
"PERC_OF_30_34",
"PERC_MARRIED_COUPLE_FAMILY",
"PERC_OTHER_STATE_OR_ABROAD",
"PERC_OTHER_STATE_OR_ABROAD",
"PERC_COLLEGE_GRADUATES",
]].copy()
X
| MEDIAN_AGE | PERC_OF_30_34 | PERC_MARRIED_COUPLE_FAMILY | PERC_OTHER_STATE_OR_ABROAD | PERC_OTHER_STATE_OR_ABROAD | PERC_COLLEGE_GRADUATES | |
|---|---|---|---|---|---|---|
| 0 | 30.8 | 52.8 | 26.4 | 8.6 | 8.6 | 10.5 |
| 1 | 35.7 | 28.2 | 36.4 | 3.4 | 3.4 | 8.2 |
| 2 | 33.4 | 28.2 | 26.6 | 2.4 | 2.4 | 17.1 |
| 3 | 33.5 | 32.5 | 35.8 | 14.8 | 14.8 | 6.9 |
| 4 | 30.6 | 31.1 | 30.4 | 3.5 | 3.5 | 11.5 |
| 5 | 39.4 | 21.1 | 38.4 | 1.9 | 1.9 | 21.3 |
| 6 | 34.8 | 32.5 | 33.7 | 5.5 | 5.5 | 12.1 |
| 7 | 36.7 | 20.9 | 29.8 | 2.3 | 2.3 | 22.9 |
| 8 | 32.5 | 27.8 | 17.8 | 2.9 | 2.9 | 18.9 |
| 9 | 31.9 | 46.1 | 24.7 | 2.4 | 2.4 | 8.4 |
| 10 | 37.1 | 33.9 | 28.3 | 6.2 | 6.2 | 8.5 |
| 11 | 42.8 | 18.1 | 46.2 | 1.9 | 1.9 | 13.2 |
model = sm.OLS(y,X).fit()
model.summary()
/Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages/scipy/stats/stats.py:1603: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
warnings.warn("kurtosistest only valid for n>=20 ... continuing "
| Dep. Variable: | NB_INCIDENTS | R-squared (uncentered): | 0.974 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared (uncentered): | 0.955 |
| Method: | Least Squares | F-statistic: | 52.40 |
| Date: | Tue, 29 Jun 2021 | Prob (F-statistic): | 2.14e-05 |
| Time: | 12:26:06 | Log-Likelihood: | -115.37 |
| No. Observations: | 12 | AIC: | 240.7 |
| Df Residuals: | 7 | BIC: | 243.2 |
| Df Model: | 5 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| MEDIAN_AGE | 2563.0215 | 499.845 | 5.128 | 0.001 | 1381.077 | 3744.966 |
| PERC_OF_30_34 | -439.2307 | 161.790 | -2.715 | 0.030 | -821.802 | -56.659 |
| PERC_MARRIED_COUPLE_FAMILY | -2160.0687 | 354.891 | -6.087 | 0.000 | -2999.253 | -1320.884 |
| PERC_OTHER_STATE_OR_ABROAD | 955.8541 | 230.079 | 4.154 | 0.004 | 411.804 | 1499.904 |
| PERC_OTHER_STATE_OR_ABROAD | 955.8541 | 230.079 | 4.154 | 0.004 | 411.804 | 1499.904 |
| PERC_COLLEGE_GRADUATES | 177.4256 | 372.764 | 0.476 | 0.649 | -704.022 | 1058.873 |
| Omnibus: | 7.823 | Durbin-Watson: | 1.947 |
|---|---|---|---|
| Prob(Omnibus): | 0.020 | Jarque-Bera (JB): | 3.721 |
| Skew: | 1.238 | Prob(JB): | 0.156 |
| Kurtosis: | 4.146 | Cond. No. | 1.64e+17 |
You can run the following cell to save your results:
ChallengeResult(
'linear_regression_model',
data=data,
model=model
).write()
You will now be asked a series of statistical inference and methodological questions about the regression model estimated above. Don't worry if results do not perfectly fit the "predictions" that we made earlier about the impact of each regressor: the goal was to form an a priori hypothesis, which is validated or invalidated through this empirical analysis.
❓ Is the median age associated with a positive (increasing the number of crimes) or a negative (decreasing the target variable) effect on the number of crime incidents? Simply write your answer as a string below
answer_median_age = '''
Ceteris Paribus, l'augmentation d'une unité de l'âge median entraine une augmentation de 2563.0215 unités du nombre
d'accidents soit doit un effet positif sur le nombre de crimes. Cet effet est significatif au sens statistique (p-value < 5%)
'''
❓ What is the t-statistic associated with the median age regressor? How is it computed?
💡 Hint: We are looking at a ratio
answer_t_statistic = '''
la t-statistic est le resultat obtenu grâce au test de sgnificativité de Student. Ce résultat doit être comparé avec la p-value.
La valeur de p est souvent utilisée dans les tests d'hypothèses, tests qui vous permettent de rejeter, ou non, une hypothèse nulle. Elle représente la probabilité de faire une erreur de type 1, ou de rejeter l'hypothèse nulle si elle est vraie.
Plus la valeur de p est petite, plus la probabilité de faire une erreur en rejetant l'hypothèse nulle est faible. Une valeur limite de 0,05 est souvent utilisée. Autrement dit, vous pouvez rejeter l'hypothèse nulle si la valeur de p est inférieure à 0,05.
Dans notre cas , la variable age median a un effet significatif (p-value < 5%) sur notre target.
'''
📝 Recompute approximately the t-statistic based on the regression summary.
As it is a ratio $t = \frac{numerator}{denominator}$:
numerator variabledenominator variablet_median_age variablenumerator = 2563.0215
denominator = 499.845
t_median_age = numerator / denominator
t_median_age
5.127632566095489
❓ What is the p-value associated with the median age regressor? Store it in the pvalue_median_age variable (you may directly copy-paste it from the regression summary).
pvalue_median_age = 0.001
❓ What does this p-value mean for the median age? Is its effect statistically significant at the 95% confidence level? At the 90% confidence level? Simply formulate your answer in the cell below.
answer_p_value = """
La valeur de p est souvent utilisée dans les tests d'hypothèses (test de Student...) , tests qui vous permettent de rejeter, ou non, une hypothèse nulle. Elle représente la probabilité de faire une erreur de type 1, ou de rejeter l'hypothèse nulle si elle est vraie.
Plus la valeur de p est petite, plus la probabilité de faire une erreur en rejetant l'hypothèse nulle est faible. Une valeur limite de 0,05 est souvent utilisée. Autrement dit, vous pouvez rejeter l'hypothèse nulle si la valeur de p est inférieure à 0,05.
Dans notre cas , la variable age median a un effet significatif (p-value < 5%) sur notre target (95%)
"""
❓ What are the two regressors whose effect is statistically significant at the 95% confidence level in this regression model? Store the name of the variables as a list of strings in the significant_regressors variable.
significant_regressors = ["MEDIAN_AGE", "PERC_MARRIED_COUPLE_FAMILY"]
❓ Holding all other regressors constant, by how much does the total number of incidents increase or decrease when the share of families with a married couple increases by 1 percentage point in the district? Please formulate a full sentence with statistical rigor!
answer_how_much_increase = """
Ceteris Paribus, l'augmentation d'une unité de la variable PERC_MARRIED_COUPLE_FAMILY entraine une diminution de 2160.0687 unités du nombre
d'accidents soit doit un effet négatif sur le nombre de crimes. Cet effet est significatif au sens statistique (p-value < 5%)
"""
You had asked the intern on the team to estimate a linear regression model so as to investigate the socio-economic determinants of crime in Boston. The results above are those that he presented. In the email he sent to you, he added:
You will probably notice the extremely high R-squared score of this model: I think we have an excellent fit and the results are solid 😄
But you have very strong doubts about this regression and you think it is a perfect occasion to give some very important advice to your intern...
❓ What is the main limitation of this (clearly spurious) regression according to you? This observation explains why we are getting a very high R-squared and large standard errors. Please provide your answer in the following Markdown cell.
answer_limitations = """
La principale limite de la régression linéaire est l'hypothèse de linéarité entre la variable dépendante et les variables indépendantes.
Dans le monde réel, les données sont rarement séparables linéairement.
Elle suppose qu'il existe une relation linéaire entre les variables dépendantes et indépendantes, ce qui est souvent incorrect.
De plus,statistiquement parlant, il faut avoir du recul sur le R2 obtenu même si il a l'air "bon".
Pour finir, un statisticien se doit de tester l'hypothèse d'HOMOSCEDASTICITÉ avant d'évaluer la signigicativité des différents coefficients
car en présence d'hétéroscedasticité, le résultat est des tests de significativité sont faussés.
"""
You can run the following cell to save your results:
import json
answers_inference = {"MEDIAN_AGE": answer_median_age,
"T_STAT":answer_t_statistic,
"P_VALUE": answer_p_value,
"INCREASE": answer_how_much_increase,
"LIMITATIONS": answer_limitations}
with open("tests/answers_inference.json", "w", encoding="utf-8") as f:
json.dump(answers_inference, f, ensure_ascii=False, indent=4)
ChallengeResult(
'linear_regression_analysis',
model=model,
numerator=numerator,
denominator=denominator,
t=t_median_age,
pvalue=pvalue_median_age,
regressors=significant_regressors
).write()
The head of the Police Department of Boston, who read your report for the Mayor's team, was extremely interested in the results. He contacted your consulting firm for an additional presentation, that would focus on the nature of crimes that take place in Boston, the potential time trends that you could identify and/or the heterogeneity of the impact of criminality on the different police districts.
You will start from a fresh dataset, that corresponds more or less to the incident_reports table of the database.
# Run this cell to download the datasets in the data directory
!curl https://wagon-public-datasets.s3.amazonaws.com/certification_france_2021_q2/incident_reports.csv > data/incident_reports.csv
!curl https://wagon-public-datasets.s3.amazonaws.com/certification_france_2021_q2/districts.csv > data/districts.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 19.1M 100 19.1M 0 0 884k 0 0:00:22 0:00:22 --:--:-- 997kk
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 498 100 498 0 0 1759 0 --:--:-- --:--:-- --:--:-- 1759
# Load the dataset
from data import load_data_viz_data
data = load_data_viz_data()
data[['LAT','LONG']] = data[['LAT','LONG']].astype('float64')
print("Shape of the DataFrame:", data.shape)
data.head()
Shape of the DataFrame: (237221, 9)
| INCIDENT_NUMBER | OFFENSE_CODE_GROUP | SHOOTING | OCCURRED_ON_DATE | LAT | LONG | NAME | LAT_POLICE_STATION | LONG_POLICE_STATION | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | I192068249 | Other | 0 | 2015-08-28 10:20:00 | 42.330119 | -71.084251 | Roxbury | 42.328894 | -71.085359 |
| 1 | I182074094 | Violence and harassment | 0 | 2015-09-14 09:31:00 | 42.315142 | -71.067047 | Roxbury | 42.328894 | -71.085359 |
| 2 | I182054888 | Violence and harassment | 0 | 2015-07-12 15:37:00 | 42.312243 | -71.075499 | Roxbury | 42.328894 | -71.085359 |
| 3 | I182054888 | Other | 0 | 2015-07-12 15:37:00 | 42.312243 | -71.075499 | Roxbury | 42.328894 | -71.085359 |
| 4 | I182054888 | Other | 0 | 2015-07-12 15:37:00 | 42.312243 | -71.075499 | Roxbury | 42.328894 | -71.085359 |
In this section, we are going to answer two specific questions that the head of the Police Department of Boston asked your team, about the characteristics of the incidents that occur in the various districts of the municipality.
First, the head of the Police Department of Boston wants to know what incident category is most often found in each district.
📝 Construct a DataFrame with
💡 Hint: you may need to first define a custom aggregation function.
!pip install pandasql
Collecting pandasql
Downloading pandasql-0.7.3.tar.gz (26 kB)
Requirement already satisfied: numpy in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from pandasql) (1.18.5)
Requirement already satisfied: pandas in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from pandasql) (1.2.4)
Requirement already satisfied: sqlalchemy in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from pandasql) (1.4.17)
Requirement already satisfied: python-dateutil>=2.7.3 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from pandas->pandasql) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from pandas->pandasql) (2021.1)
Requirement already satisfied: six>=1.5 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->pandasql) (1.14.0)
Requirement already satisfied: greenlet!=0.4.17 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from sqlalchemy->pandasql) (1.1.0)
Building wheels for collected packages: pandasql
Building wheel for pandasql (setup.py) ... done
Created wheel for pandasql: filename=pandasql-0.7.3-py3-none-any.whl size=26818 sha256=1d6741d7a85322bc7812592ffabeeb91c01c48e2aec8e1c69c06e10821cbe3bd
Stored in directory: /Users/jaouadahraoui/Library/Caches/pip/wheels/ed/8f/46/a383923333728744f01ba24adbd8e364f2cb9470a8b8e5b9ff
Successfully built pandasql
Installing collected packages: pandasql
Successfully installed pandasql-0.7.3
WARNING: You are using pip version 21.1.2; however, version 21.1.3 is available.
You should consider upgrading via the '/Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/bin/python3.8 -m pip install --upgrade pip' command.
import pandasql as ps
q1 = """SELECT NAME, OFFENSE_CODE_GROUP
FROM data
GROUP BY NAME
ORDER BY COUNT(OFFENSE_CODE_GROUP)"""
print(ps.sqldf(q1, locals()))
NAME OFFENSE_CODE_GROUP 0 Charlestown Larceny and vandalism 1 West Roxbury Police investigation procedure 2 East Boston Fraud and law violations 3 Hyde Park Fraud and law violations 4 Jamaica Plain Other 5 Brighton Fraud and law violations 6 South Boston Fraud and law violations 7 Downtown Fraud and law violations 8 Mattapan Violence and harassment 9 South End Fraud and law violations 10 Dorchester Fraud and law violations 11 Roxbury Other
❓ Can you tell what is the second most common offense int the Brighton district?
second_most_common_offence = "Fraud and law violations"
Second, based on the Haversine distance function defined below, the head of the Police Department would like to know, for each district, the average distance between the location of the incident and the police station.
📝 Construct a DataFrame with one row per police district, designated by its full name, and one column that displays this average Haversine distance.
# Haversine distance function
from math import radians, sin, cos, asin, sqrt
def haversine_distance(lon1, lat1, lon2, lat2):
"""
Compute distance (km) between two pairs of (lat, lng) coordinates
See - (https://en.wikipedia.org/wiki/Haversine_formula)
"""
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
return 2 * 6371 * asin(sqrt(a))
# YOUR CODE HERE
# YOUR CODE HERE
❓ Can you tell what is the average distance between the police station and the offenses in the Brighton district?
# YOUR CODE HERE
result = ChallengeResult('manipulation',
second_most_common_offence=second_most_common_offence,
average_distance_km=average_distance_km)
result.write()
🚨🚨🚨🚨🚨🚨

#Top 10 crime reported streets
import pandasql as ps
q1 = """SELECT NAME, COUNT(OFFENSE_CODE_GROUP) as Number_of_offense
FROM data
GROUP BY NAME"""
top_crime_reported = ps.sqldf(q1, locals())
top_crime_reported
| NAME | Number_of_offense | |
|---|---|---|
| 0 | Brighton | 13788 |
| 1 | Charlestown | 4765 |
| 2 | Dorchester | 32875 |
| 3 | Downtown | 26260 |
| 4 | East Boston | 9691 |
| 5 | Hyde Park | 12551 |
| 6 | Jamaica Plain | 12802 |
| 7 | Mattapan | 28331 |
| 8 | Roxbury | 38877 |
| 9 | South Boston | 16617 |
| 10 | South End | 31258 |
| 11 | West Roxbury | 9406 |
import seaborn as sns
ax = sns.barplot(x="NAME", y="Number_of_offense", data = top_crime_reported)
ax.set(xlabel='Crime district', ylabel='# of Crimes reported')
ax.set_xticklabels(top_crime_reported['NAME'],rotation=90)
plt.show()
# Top 10 Offense types
import pandasql as ps
q1 = """SELECT OFFENSE_CODE_GROUP, COUNT(OFFENSE_CODE_GROUP) as Number_of_offense
FROM data
GROUP BY OFFENSE_CODE_GROUP"""
top_offense_types = ps.sqldf(q1, locals())
top_offense_types
| OFFENSE_CODE_GROUP | Number_of_offense | |
|---|---|---|
| 0 | Disputes | 15421 |
| 1 | Drugs and disorderly conduct | 19022 |
| 2 | Fraud and law violations | 26801 |
| 3 | Larceny and vandalism | 76038 |
| 4 | Other | 19399 |
| 5 | Police investigation procedure | 50520 |
| 6 | Violence and harassment | 30020 |
import seaborn as sns
ax = sns.barplot(x="OFFENSE_CODE_GROUP", y="Number_of_offense", data = top_offense_types)
ax.set(xlabel='Crime Street', ylabel='# of Crimes reported')
ax.set_xticklabels(top_crime_reported['OFFENSE_CODE_GROUP'],rotation=90)
plt.show()
! pip install folium
Requirement already satisfied: folium in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (0.12.1)
Requirement already satisfied: requests in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from folium) (2.25.1)
Requirement already satisfied: branca>=0.3.0 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from folium) (0.4.2)
Requirement already satisfied: jinja2>=2.9 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from folium) (2.11.3)
Requirement already satisfied: numpy in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from folium) (1.18.5)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from jinja2>=2.9->folium) (1.1.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from requests->folium) (1.26.4)
Requirement already satisfied: idna<3,>=2.5 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from requests->folium) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from requests->folium) (2020.12.5)
Requirement already satisfied: chardet<5,>=3.0.2 in /Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/lib/python3.8/site-packages (from requests->folium) (4.0.0)
WARNING: You are using pip version 21.1.2; however, version 21.1.3 is available.
You should consider upgrading via the '/Users/jaouadahraoui/.pyenv/versions/3.8.6/envs/lewagon/bin/python3.8 -m pip install --upgrade pip' command.
# Create basic Folium crime map
import folium
from folium.plugins import HeatMap
crime_map = folium.Map(location=[42.3125,-71.0875],
tiles = "Stamen Toner",
zoom_start = 11)
# Add data for heatmp
# data_heatmap = data[data.Year == 2017]
data_heatmap = data[['LAT','LONG']]
data_heatmap = data.dropna(axis=0, subset=['LAT','LONG'])
data_heatmap = [[row['LAT'],row['LONG']] for index, row in data_heatmap.iterrows()]
HeatMap(data_heatmap, radius=10).add_to(crime_map)
# Plot!
crime_map
Parmis les crimes graves, nous avons constaté une prédominance des larcins et des vandalismes. Ainsi, il serait interessant de mettre en place une équipe en charge de ce type de crimes afin d'y apporter des solutions visant à réduire sa prédominance.
Aussi, nous avons constaté une prédominance des crimes graves dans certains districts et plus précisement dans les ditricts suivants: Roxbury, South End, Mattapan Ainsi, serait pertinent de plus de partouilles dans ces zones là afin de leur permettre d'intervenir en nombre et plus rapidement.
Cette EDA ne fait qu'effleurer la surface de l'ensemble des données:
Using the latest dataset that you loaded, your mission is now to prepare 5 slides (including a title slide) that you would present to the head of the Police Department. You may or may not, as you prefer, include the output of the two "Further data manipulations" tasks in your presentation.
⚠️ You can use any presentation editor of your choice, but the slides must be shared either in HTML or in PDF format and saved in the current directory
Before you get started, here are four small pieces of advice:
resample method might be useful if you want to plot time trends;Good luck in changing Boston residents' lives!
Congratulations for going through the exercise 🎉
If you wish to pursue your analysis at some point, note that all datasets (and many others) are publicly available online, on the Analyze Boston website.
Besides, if you are interested in the topic, you can start by reading the work of Nobel Prize laureate Gary Becker, who was the first to model crime as a rational phenomenon, similarly to an economic decision. This model, although it has limitations, marked a breakthrough in the study of crime and paved the way for many empirical studies that further analysed the socio-economic determinants of illegal activities.
👉 Link to download a full-text version of "Crime and Punishment: An Economic Approach" by Becker (1968)